Facility for highlighting documents accessed through search or browsing

ABSTRACT

An information highlighting facility assists the user in evaluating relevance of accessed documents to the user&#39;s information need. The accessed documents may, for example, be identified by a search engine in response to a user query. When accessing documents identified as relevant by a search engine from other networked computers, the facility provides information highlighting to assist the user in determining whether the document is relevant. A model of the user&#39;s interest, which may include an augmented set of search terms, is used to take into account the general interest of the user as captured by an interest profile and context of use of the computer by the user, or a combination thereof. The model of the user&#39;s interest is applied to the document text as the document is accessed from its source. The highlighting of information about the document content may include highlighting of the terminology in the text, scrolling of the document to the relevant passages, identification of entity names and entity relations, creation of a document summary and a document thumbnail, etc. In addition, the model can be applied to a set of documents accessed by the user, e.g., to re-rank the top scoring documents from the result set provided to the user by a search engine or some other information providing services.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a divisional of U.S. patent application Ser. No.09/578,302, filed on May. 25, 2000, and entitled, “FACILITY FORHIGHLIGHTING DOCUMENTS ACCESSED THROUGH SEARCH OR BROWSING.” Thisapplication is also related to co-pending U.S. Continuation patentapplication Ser. No. ______, filed ______, entitled, “FACILITY FORHIGHLIGHTING DOCUMENTS ACCESSED THROUGH SEARCH OR BROWSING”, (Atty.Docket No. MS 131774.02/MSFTP240USA) and co-pending U.S. DivisionalPatent Application Ser. No. ______, filed ______, and entitled,“FACILITY FOR HIGHLIGHTING DOCUMENTS ACCESSED THROUGH SEARCH ORBROWSING”, (Atty. Docket No. MS 131744.04/MSFTP240USC). The entiretiesof the aforementioned applications are incorporated herein by reference.

FIELD OF THE INVENTION

This invention relates generally to the field of computers, and inparticular to enhancing query results provided by a search engine.

COPYRIGHT NOTICE/PERMISSION

A portion of the disclosure of this patent document contains materialwhich is subject to copyright protection. The copyright owner has noobjection to the facsimile reproduction by anyone of the patent documentor the patent disclosure as it appears in the Patent and TrademarkOffice patent file or records, but otherwise reserves all copyrightrights whatsoever. The following notice applies to the software and dataas described below and in the drawing hereto: Copyright © 2000,Microsoft Corporation, All Rights Reserved.

BACKGROUND

The World Wide Web (WWW), often referred as the Web, is a fast growingnetwork that involves a vast quantity of data and numerous types ofservices aimed at accessing, organizing, and distributing that data. Inparticular, there are millions of documents on the Web and many on-linesearch services that enable the users to find documents that are ofinterest to them.

Furthermore, documents on the Web are linked via hyperlinks, created bythe authors of the documents, which enable the users to browse throughdocuments on their own by following the links that interest them.

The large quantity of the Web data and the fast rate of Web expansionhave immanent implications on the ways the services on the Web canapproach the problem of processing Web data.

Collecting and processing all or a majority of Web documents with anappropriate rate of updating the information that has been collectedabout these documents is often not feasible. Indeed, the processingpower and the network bandwidth are not yet up to the task. However,there is also a more fundamental reason: because of the distributednature of the data the services are not in control of the documentchange—the authors of Web documents can change them at any time, asneeded. That is why, among other reasons, search engines do not deliverthe document text in response to the user's query. The search engines atbest deliver the title and some type of summary of a document that iscreated by the search engine based on the version of the documentavailable at the time the document was collected and indexed. The searchengine points the user to the URL, i.e., the location of the document onthe Web at the time the document was collected. It is up to the userthen to execute the URL link and access the document text, which may ormay not be the same as the text processed and summarized by the searchengine.

This lack of control over the content of documents on the Web requiresnew approaches in providing some of the basic and commonly provideddocument management features of traditional document management systems.Such features include: marking of the query terminology in the documenttext to help the user identify the portions of the text that talk aboutthe desired topic, to assess the document relevance to the topic, etc.;summarizing document text to extract most salient sentences or queryspecific portions of the text; analyzing the text to identify andextract entities that may be of particular interest to the user, e.g.,person names, company names, locations, etc., or relations among theseentities; creating various visual representations of the document tohelp with browsing through the document, assessing document relevance,etc.

Since the documents on the Web are frequently accessed in the browsingmode by following the hyperlinks in the documents, the same type ofdocument management support is needed for browsing among and through Webdocuments.

Furthermore, since the type and the quality of services on the Web vary,the users on the Web often need to explore which of them can handle besta particular request for information. For example, if the user isengaging a couple of search engines to find certain types of documents,this often involves retyping the query in the appropriate search windowof the individual search engines. There is a need for a facility thatcan assist the user in specifying the user's information need and thatcreates various representations of that need suitable for interfacingwith various Web services.

In summary, there is a need to provide the user with the facilities forobtaining better information regarding the relevancy of documentspointed to by various services on the Web or accessed by browsing theWeb documents. There is a further need to provide such information basedon the current versions of the documents. There is still a further needto provide the user with a consistent manner in which such relevancy isidentified regardless of the way the document is accessed (based on Webservice information or browsing or the combination of). There is yet afurther need to provide a rich representation of the user's informationneed.

SUMMARY OF THE INVENTION

An information highlighting facility on a computer assists the user insearching, browsing, and reading documents on the Web or similardistributed network environments. When the user downloads a documentfrom the Web, e.g., by following a hyperlink while browsing the Web orby choosing one of the documents that a search engine (or some other Webservice) found relevant to a previously issued query, the informationhighlighting facility provides information to assist the user indetermining whether the document is of interest to the user. Thefacility matches the document text with a model of the user'sinformation need that has been created by the facility (independentlyfrom the services that the user is using on the Web) and supports anumber of document analyses.

In the case of search, the document text is analyzed with respect to theuser's specified information need. In this instance, the assistance inassessing the document relevance may be provided by marking keywords orkey phrases within documents to make them easier to spot, by scrollingto what seems to be the most relevant portion of the document, etc., orby combinations thereof. Additional assistance can be by extractingspecified features from the document such as company names, personnames, location names, etc., by summarizing documents in view of theuser's query, by constructing thumbnail images of documents with queryterms highlighted, etc. Furthermore, the facility can providealternative ranking of documents pointed to by the search engine on thebasis of the richer representation of the user's need that the facilitycreated. That can be achieved by pre-fetching, analyzing, and re-rankinga selection of documents that were originally pointed to by the searchengine.

In the case of browsing, for example, the user can specify in advance orat the time the document is accessed, a perspective from which the userwants the document to be analyzed. For example, the user can provide theinformation highlighting facility with a description of the topic theuser is interested in or other for analyzing documents criteria (e.g., aformat specification of the document). This description of the user'spreferences can be applied to analyze the accessed documents (currentlyand subsequently) as well as used to give a relevance assessment of thedocuments pointed to by the hyperlinks in the currently viewed document.Relevance assessment of hyperlinks could be achieved, for example, bydownloading and analyzing the linked documents in the background andproviding the user with the qualitative characterization of the links.

To assist the user in reading and assessing the documents, theinformation highlighting facility creates a description or a model ofthe user's need or interest. This model is used as the basis for variousdocument analyses. Model may include, but is not limited to,descriptions of queries that the user is sending to search engines onthe Web, a general ‘profile of interest’ that the user specified (e.g.,by means of a dialog), the augmented versions of these descriptions thatthe highlighting facility created based on further linguistic and/orsemantic analysis, or additional information that the highlightingfacility may collect or infer about the user's current task. The usermay also request some generic types of analysis to be applied, e.g.,extraction of certain types of entity names or entity relations that maybe contained in the document. This model of the user interest serves asa context for the analysis of the accessed or pre-fetched documents.

The processing required for the construction of the model can be donelocally using facilities on the user's computer, or as an externalservice (e.g., at a dedicated server on the network), or as acombination of the two. Furthermore, the model construction can be donesimultaneously and independently from the other services that the useris using on the Web (e.g., search engines).

The information highlighting facility applies the model to the documentsthat are accessed by the user, or if required for some types ofanalysis, by pre-fetching the documents in the background. The resultsof the various analyses are presented appropriately (by inserting markups in the document, extracting information into separate windows, orcreating various other visual representations).

The facility also provides support for managing various user interestmodels and applying them by the user as needed both for documentanalysis and for interfacing with other Web services (e.g., the user cansimply point to the portion of the model representation that needs to besent to a particular Search service as a query).

The principles on which the information highlighting facility is builtallow for incorporation of various types of document analysis. Forexample, it can include but is not limited to: terminology marking,scrolling, re-ranking, document thumbnailing, summarization and linkanalysis.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of a computer system on which the presentinvention may be implemented.

FIG. 2A is a block flow diagram showing interaction of the presentinvention with a Web based information service (e.g., a search engine)and browser.

FIG. 2B is a block flow diagram of a service for creating a model of theuser's interest and management of documents and document requests.

FIG. 3 is a flow diagram showing the flow of creation of a context andits application to documents to provide highlighting.

FIG. 4 is a block diagram showing components involved in providingaugmented search terms and highlighting.

FIG. 5 is a flow diagram showing scrolling of a document to its mostrelevant portion.

FIG. 6 is a flow diagram showing re-ranking of documents provided by asearch engine.

FIG. 7 is a flow diagram showing the identification and provision of alist of names associated with a document.

FIG. 8 is a flow diagram showing the creation of a thumbnail of adocument with highlighting.

FIG. 9 is a flow diagram showing the creation of a summary of adocument.

DETAILED DESCRIPTION

In the following detailed description of exemplary embodiments of theinvention, reference is made to the accompanying drawings which form apart hereof, and in which is shown by way of illustration specificexemplary embodiments in which the invention may be practiced. Theseembodiments are described in sufficient detail to enable those skilledin the art to practice the invention, and it is to be understood thatother embodiments may be utilized and that logical, mechanical,electrical and other changes may be made without departing from thespirit or scope of the present invention. The following detaileddescription is, therefore, not to be taken in a limiting sense, and thescope of the present invention is defined only by the appended claims.

The detailed description is divided into multiple sections. A firstsection describes the operation of a computer system which implementsthe current invention. This is followed by a high level description ofthe invention, including how the model of the user's interest isgenerated and used. Further embodiments are then described, includingre-ranking of documents and extracting and generating information fromthe documents to further assist the user in reading and assessing theaccessed documents.

Hardware and Operating Environment

FIG. 1 provides a brief, general description of a suitable computingenvironment in which the invention may be implemented. The inventionwill hereinafter be described in the general context ofcomputer-executable program modules containing instructions executed bya personal computer (PC). Program modules include routines, programs,objects, components, data structures, etc. that perform particular tasksor implement particular abstract data types. Those skilled in the artwill appreciate that the invention may be practiced with othercomputer-system configurations, including hand-held devices,multiprocessor systems, microprocessor-based programmable consumerelectronics, network PCs, minicomputers, mainframe computers, and thelike which have multimedia capabilities. The invention may also bepracticed in distributed computing environments where tasks areperformed by remote processing devices linked through a communicationsnetwork. In a distributed computing environment, program modules may belocated in both local and remote memory storage devices.

FIG. 1 shows a general-purpose computing device in the form of aconventional personal computer 20, which includes processing unit 21,system memory 22, and system bus 23 that couples the system memory andother system components to processing unit 21. System bus 23 may be anyof several types, including a memory bus or memory controller, aperipheral bus, and a local bus, and may use any of a variety of busstructures. System memory 22 includes read-only memory (ROM) 24 andrandom-access memory (RAM) 25. A basic input/output system (BIOS) 26,stored in ROM 24, contains the basic routines that transfer informationbetween components of personal computer 20. BIOS 26 also containsstart-up routines for the system. Personal computer 20 further includeshard disk drive 27 for reading from and writing to a hard disk (notshown), magnetic disk drive 28 for reading from and writing to aremovable magnetic disk 29, and optical disk drive 30 for reading fromand writing to a removable optical disk 31 such as a CD-ROM or otheroptical medium. Hard disk drive 27, magnetic disk drive 28, and opticaldisk drive 30 are connected to system bus 23 by a hard-disk driveinterface 32, a magnetic-disk drive interface 33, and an optical-driveinterface 34, respectively. The drives and their associatedcomputer-readable media provide nonvolatile storage of computer-readableinstructions, data structures, program modules and other data forpersonal computer 20. Although the exemplary environment describedherein employs a hard disk, a removable magnetic disk 29 and a removableoptical disk 31, those skilled in the art will appreciate that othertypes of computer-readable media which can store data accessible by acomputer may also be used in the exemplary operating environment. Suchmedia may include magnetic cassettes, flash-memory cards, digitalversatile disks, Bernoulli cartridges, RAMs, ROMs, and the like.

Program modules may be stored on the hard disk, magnetic disk 29,optical disk 31, ROM 24 and RAM 25. Program modules may includeoperating system 35, one or more application programs 36, other programmodules 37, and program data 38. A user may enter commands andinformation into personal computer 20 through input devices such as akeyboard 40 and a pointing device 42. Other input devices (not shown)may include a microphone, joystick, game pad, satellite dish, scanner,or the like. These and other input devices are often connected to theprocessing unit 21 through a serial-port interface 46 coupled to systembus 23; but they may be connected through other interfaces not shown inFIG. 1, such as a parallel port, a game port, or a universal serial bus(USB). A monitor 47 or other display device also connects to system bus23 via an interface such as a video adapter 48. In addition to themonitor, personal computers typically include other peripheral outputdevices (not shown) such as speakers and printers.

Personal computer 20 may operate in a networked environment usinglogical connections to one or more remote computers such as remotecomputer 49. Remote computer 49 may be another personal computer, aserver, a router, a network PC, a peer device, or other common networknode. It typically includes many or all of the components describedabove in connection with personal computer 20; however, only a storagedevice 50 is illustrated in FIG. 1. The logical connections depicted inFIG. 1 include local-area network (LAN) 51 and a wide-area network (WAN)52. Such networking environments are commonplace in offices,enterprise-wide computer networks, intranets and the Internet.

When placed in a LAN networking environment, PC 20 connects to localnetwork 51 through a network interface or adapter 53. When used in a WANnetworking environment such as the Internet, PC 20 typically includesmodem 54 or other means for establishing communications over network 52.Modem 54 may be internal or external to PC 20, and connects to systembus 23 via serial-port interface 46. In a networked environment, programmodules, such as those comprising Microsoft® Word which are depicted asresiding within 20 or portions thereof may be stored in remote storagedevice 50. Of course, the network connections shown are illustrative,and other means of establishing a communications link between thecomputers may be substituted.

Software may be designed using many different methods, including objectoriented programming methods. C++ and Java are two examples of commonobject oriented computer programming languages that providefunctionality associated with object-oriented programming. Objectoriented programming methods provide a means to encapsulate data members(variables) and member functions (methods) that operate on that datainto a single entity called a class. Object oriented programming methodsalso provide a means to create new classes based on existing classes.

An object is an instance of a class. The data members of an object areattributes that are stored inside the computer memory, and the methodsare executable computer code that act upon this data, along withpotentially providing other services. The notion of an object isexploited in the present invention in that certain aspects of theinvention are implemented as objects in one embodiment.

An interface is a group of related functions that are organized into anamed unit. Each interface may be uniquely identified by someidentifier. Interfaces have no instantiation, that is, an interface is adefinition only without the executable code needed to implement themethods which are specified by the interface. An object may support aninterface by providing executable code for the methods specified by theinterface. The executable code supplied by the object must comply withthe definitions specified by the interface. The object may also provideadditional methods. Those skilled in the art will recognize thatinterfaces are not limited to use in or by an object orientedprogramming environment.

Invention Overview

A block flow diagram of operation of the invention is shown in FIG. 2Agenerally at 200. An information highlighting facility is designated asInformation highlighting facility 210 as shown in FIG. 2A and 2B. Theterm highlighting facility refers to multiple functions used tohighlight the relevancy of one or more documents as described below. Itis not meant to be a term that refers only to the common function ofhighlighting text. The information highlighting facility also includes adocument analysis facility to analyze documents prior to applyinghighlighting functions.

A user's information need is represented at 205 in FIG. 2A. The need iscommunicated to a means of accessing the web, such as a web browser 208,and to an information highlighting facility 210. The informationhighlighting facility 210 creates a model of the user's information needthat is more or less independent of the expression of the user'sinformation need that is communicated by the user to a particularinformation providing service 212 (e.g., search engines on the Web). Theinformation providing service 212 also comprises an index 213 thatidentifies documents 214 by means of an address or URL from which a webbrowser 217 may retrieve and display documents. Documents may also beprovided directly to the information highlighting facility 210.

Input to the information highlighting facility 205 can be, for example,a single query or a set of queries 215 communicated by the user to theWeb information providing service 212 (e.g., queries to a Searchengine). These queries are in one embodiment captured from the Web pageof a search engine at the time the user types a query into the searchbox provided by a user interface 216. This is referred to as an implicitcharacterization of the user's information need since it was notdirectly communicated to the information highlighting facility 210, butrather captured by the information highlighting facility 210 bymonitoring the user's actions. Similarly, the system used by the usercan monitor user's actions and provide information on the task the useris performing 218 (e.g., working on a report, sending an e-mail message,etc.) as a context for the information highlighting facility analysis tocreate the model of the user's information need.

In another embodiment the information highlighting facility provides aquery box that serves the purpose of specifying the query. The specifiedquery is then sent (copied and pasted, dragged and dropped) to thesearch box 216 of a desired search engine. The user is then not requiredto retype the query when changing from one search engine to another.

Another, more explicit way of providing information highlightingfacility 210 with the characterization of the user's need is by using auser's specification of the task and intentions at 218 (for example, ina form of a dialogue with information highlighting facility 210) and/orthe user's detailed description of the information need at 220 (a directinput to information highlighting facility 210). Note, parts or all ofthe full description of the user's need are then useable forcommunicating with a particular information providing service (e.g., asearch engine to information directory on the Web).

Information highlighting facility 210 is provided with a GUI 222(graphical user interface) that enables direct input from the user. Inparticular, the user may specify a desired type of informationhighlighting facility 210 analysis that should be applied to the vieweddocuments, with details on the parameters to be used in the analysis(when required) and preferences on the display of results as indicatedat 223. Furthermore, the user may provide information on a particulartask the user is currently performing as represented at 224 to ensurethat the analyses are context sensitive when applicable.

Information highlighting facility 210 contains a module 225 for managingpast requests for information analysis (e.g., storing, retrieving,concatenating queries and information need descriptions) and/ordocuments that have been downloaded and analyzed.

Information highlighting facility 210 analyses typically involve threecomponents: format recognition and analysis module 227, content analyses228 (e.g., linguistic and statistical analysis of the text), andresources 229 required for the analyses (e.g., linguistic and knowledgeresources for identifying company names in the text).

The user specifies the information need 205 to information highlightingfacility 210 directly or indirectly by communicating it to the Webinformation providing service 212. The system or the user may alsoprovide information on a task that the user is currently performing. Theuser also specifies the type of information highlighting facilityanalysis that should be performed on the accessed documents.

This request for information is communicated via Web browser 217 to theinformation providing service. As a result, the user is provided withURL's and perhaps some additional information about documents thatpotentially satisfy the user's information need. For example, in case ofWeb search engines, the result of a search is typically a ranked list ofdocument titles with short summaries and URL's.

Based on the task context 224 and the specification of the user'sinformation need, information highlighting facility 210 creates a modelofthe user's information need represented at 232.

FIG. 2B provides further information about process flow of theinvention. The numbering of modules is consistent with FIG. 2A.Information highlighting facility 210 provides several features toenhance or highlight documents as indicated at 240. Such features mayinclude terminology highlighting, document scrolling, entity extractionand relation finding, hyperlink analysis, document relevance ranking,document thumbnails, and document summarization.

As an example of the process flow, if the user desires to have relevantterminology from the information request highlighted in the accesseddocuments, information highlighting facility 210 processes the requestfor information using linguistic analysis tools 228 and knowledgeresources 229 to create a rich model 232 of the topic of interest. Forexample, it may perform synonym expansion of the original terms in theinformation request to ensure that relevant information is highlightedin the document without the need for the user to try to anticipate thelinguistic variations in which the topic is described in the text.

As the user accesses a document, the model of the user's informationneed is used in the analysis of the document. For example, terminologyhighlighting is achieved by detecting in the document text (e.g.,pattern matching) the terminology from the rich linguisticrepresentation of the user's information need created by informationhighlighting facility 210. The user can specify various parametersrelated to terminology highlighting at 223. For example, the user mayprefer to have terminology from the original description of theinformation need highlighted in one color while all the synonyms in someother color. Or, perhaps, the user may want only the occurrence ofmulti-word phrases from the request highlighted in the document, etc.

Some types of information highlighting facility analysis may requirepre-fetching the document text in the background as the user isperforming other tasks, e.g., viewing the result list from the searchengine. For example, suppose that the user requested that thumbnailimages of documents that were indicated by the search engine bedisplayed with query terminology highlighted in them. In that case, thetext of documents from the search result page being viewed by the usercould be downloaded in the background as represented by communicationline 245, analyzed for query terminology and document layout and thehighlighted thumbnail images would be displayed.

Similarly, suppose that the user requested an alternative ranking of thesearch result based on the rich information highlighting facilityrepresentation of the user's need (as oppose to the short query that theuser may have communicated to the search engine). The document text ofsome selected documents (e.g., top N ranked documents) could bepre-fetched in the background, linguistically and statisticallyprocessed, and compared with the information highlighting facility 210model of the user's interest. The documents would be scored andalternative ranking of them presented to the user.

Many of the information highlighting facility 210 analyses could beequally applied to the documents accessed as the user is browsingthrough the documents.

Information highlighting facility 210 may be implemented as a localservice on the user's desktop or as a remote service, or can be acombination of the two. For example, some information highlightingfacility 210 analyses could employ the locally available resources(e.g., thesauri or knowledge base that the user may have availablelocally).

When applied as a remote service (and thus used by a number of users),information highlighting facility 210 could benefit from the informationit may store on the user community. For example, it may store some typesof analysis of documents that have been performed as a result of theusers' requests within a certain period of time (e.g., an hour, or aday, etc.).

For example if a user A requested that the accessed documents beanalyzed for company names and person names, information highlightingfacility 210 can perform this analysis and store the analysis results.When a user B accesses the same document and asks for the same analysisthe results could be delivered without repeating the document analysis(and thus saving the processing time).

As indicated above, information highlighting facility 210 capturesinformation about the user's need. This can be done, in one embodiment,based on the queries that the user issues to the Web Search engines ordifferent Web services at the service Web site. It can also be based onthe user's description of the user's interest or information needcommunicated directly to information highlighting facility 210 throughthe information highlighting facility interface 222. Furthermore, theinformation highlighting facility 210 may make inferences or collectfrom the user explicitly (e.g., through a dialog) information about theuser's task or intentions or preferences about the characteristics ofdocuments (e.g., format of the documents that the user wants to accessor avoid) or similar.

Based on the collected information, the information highlightingfacility 210 builds the representation or model of the user's interest.This model than provides a context for analysis and informationhighlighting of any document accessed by the user. In one embodimentthese are the documents downloaded from the Web. However, informationhighlighting facility 210 can be extended with components that recognizeformats of documents from various sources (e.g., documents created byapplications running locally on the user's desktop, documents deliveredvia e-mail, etc.). All information highlighting facility 210 featurescould then be applied to the content of those documents and the resultsdisplayed appropriately.

Users may access documents by directly executing a URL of the desireddocument via the browser 217 or may follow a hyperlink in the currentlyviewed document or may select to access documents from a list of URLspresented to the user by a Web service (Search or others) as a result ofthe user's request for information.

As the documents are downloaded by the browser 217 they are processed bythe information highlighting facility 210 in view of the model of theuser's interest. The results of the information highlighting facility210 processing are then displayed appropriately to the user. Informationhighlighting facility 210 may include a number of different features andsupporting analyses comprising but not limited to: marking ofterminology in the text, scrolling to the relevant passages in thedocument, extracting specified entity names and relations among entitiesin the text, summarizing documents by selecting sentences salient to thecontent of the document, or related to the query, etc., rankingdocuments in a designated document set with respect to the informationhighlighting facility 210 representation of the user's need, analyzinghyperlinks in the viewed documents with respect to the user's need, andcreating various visual representation of the documents, such asthumbnail document images with highlighted information in the documenttext and hyperlinks to support reading of and browsing through thedocument text.

The information highlighting facility 210 provides support for storingand managing various models of the user's interests. In particular itenables the user to select which of the existing models or combinationof the existing models should be used as the context for the analysis ofdocuments.

If the user wishes to engage Search or similar Web services forinformation seeking the user's queries or parts of the comprehensiveinformation highlighting facility 210 model of the user's interest 232are sent via browser 217 such as Internet Explorer for processing by theservice 212. The user interface 216 running on the service end receivesqueries and performs the search operation over the documents that havebeen collected and processed by the service. Typically the servicesstore information about the documents, including the document URL(uniform resource locator) in the form of index 213. As a result of thequery processing, document identifiers, such as URLs, are retrieved fromthe index and typically ranked in relevance to the queries. The URLs aresent back to the client.

In one embodiment, the user's interest model is generated by analyzingthe query terms as entered by the user in 216. This may involve creatingan augmented set of search terms based on syntactic analysis andsemantic expansion of the user's query. The information highlightingfacility 210 then provides highlighting of the original and expandedquery terminology in the documents accessed upon the user request (viadocument identifier, the URL). Furthermore, the information highlightingfacility 210 may use information about the wider context, e.g., the usertask or user's explicit preferences to perform the terminologyhighlighting appropriately. For example, to support more efficientreading of the document, information highlighting facility 210 mayperform selective terminology highlighting in the text by highlightingonly key concepts from the user's interest model in the paragraphs thatare assessed as most relevant to the user's need.

In one embodiment the information highlighting facility 210 receives thelist of URLs from the Search engine or other Web service and begins todownload documents 214 identified via browser 217 in the background(while the user is performing other tasks, like reading the result list,etc.) in order to perform the linguistic and statistical analysis of thedocument texts. MS Read then re-ranks the documents with respect totheir relevance to the user's interest model, a more comprehensiverepresentation of the user's interest than the one presented by the userto the Search or some other Web service 212.

In one embodiment, information highlighting facility 210 performsdocument analysis without a need for downloading and analyzing thedocument text in advance or in the background. This is done based onsimple text analysis that requires no significant overhead in theprocessing time than it is required to download and display thedocument. In still a further embodiment, other document analysis can beperformed in the background as represented by line 245. This analysismay be more involved and require each document to be downloaded. Bothapproaches can be used to support entity extraction and relationfinding, document summarization, etc.

In case that the user engages in browsing through Web documents the usercan either specify an existing context, i.e., a model of the user'sinterest or need that information highlighting facility 210 createdpreviously or can initiate a creation of the new one by providinginformation to the information highlighting facility 210 in variousforms, including but not limited to a description of a particular topicinterest, preferences, intentions and purpose of the browsing task, etc.Information highlighting facility 210 then creates the appropriateuser's interest model as described above and applies them to thedocuments as the user browses the Web. In one embodiment, theinformation highlighting facility 210 downloads in the background thedocuments that are pointed to by the hyperlinks in the currently vieweddocument. These documents are analyzed with respect to the current modelof the user's interest. The result of the analysis is information to theuser about the relevance of the hyperlinks and suggestion for furthersteps in browsing. In other embodiments the hyperlink analysis isperformed by the information highlighting facility 210 based on the textin the current document that surrounds the hyperlinks, thus without theneed to download the linked documents in the background.

Analyses performed by the information highlighting facility 210 can beperformed locally, using the local information resources as needed(linguistic resources such as lexicons, dictionaries, knowledge base,etc.) or remotely or as a combination of the two. The types of analysesinclude but are not limited to:

Terminology marking. When a document is downloaded, the terminologydescribing the user model can be highlighted, for example, by makingkeywords and key phrases bolder than the surrounding text, or bychanging the background color to facilitate easier spotting in the text.In one embodiment this type of terminology marking can be done at thetime the document is downloaded. In another embodiment, a moresophisticated terminology marking is provided by pre-fetching andanalyzing the document text in the background (e.g., while the user isperforming other tasks, such as reading the document titles in theresult sets of the search engines).

Scrolling. When a document is downloaded, it can be scrolled, forexample, to the most relevant portion of a multi-page document. This canbe done, for example, by statistical and linguistic analysis of the textthat involves scoring individual paragraphs or subparts of the documentwith respect to the user model. Alternatively, it may be based on asimple statistical analysis of the occurrences of terminology from theuser's interest model in the text at the time the document is beingdownloaded, thus with no need for pre-fetching the document text.

Re-ranking. The list of documents provided by one or more search enginesmay be re-ranked based on relevance ranking and based on arepresentation of the user's need. The re-ranking may be based on butnot restricted to the analysis of information from the summariesprovided by the search engines or by pre-fetching the document text andperforming additional relevance assessment. This analysis may range fromsimple pattern matching of the document text and the terminology in theuser model to deeper linguistic and statistical analyses and relevancescoring of the document texts.

Document Thumbnailing. Based on a downloaded document, a thumbnail imageof the document may be created with or without highlighting of variousinformation found in the document text (e.g., the user query term, theexpanded model of the user need, most salient sentences in the text,etc.). Links from the thumbnail image to the document text could beprovided to enable easy browsing through the document. By providingvisual cues, the thumbnail image of a document provides assistance isassessing the relevance of the whole or parts of the document.

Summarization. A summary of the document text can be provided by but isnot restricted to extracting salient sentences from the text asidentified, for example, by pattern matching with the terminology of theuser's interest model or by a deeper linguistic and statistical analysisof the document text. In one embodiment, the summaries are generatedbased on various entities and entity relations found in the text,related to or independent from the current user's interest model.

Link analysis. The internal and external links on a web page can beassessed by, for example, downloading the text of the linked documentsin the background and assessing their utility with respect to the usermodel. Such information may be communicated to the user as an aid indeciding whether or not to follow the links.

In FIG. 3, a terminology highlighting or marking facility, which is oneof the features of the information highlighting facility 210 isindicated generally at 310. The terminology highlighting facilityconsists of a client component 315 (i.e., highlighter) that can be anindependent application or part of a browser. The highlighter operatesin one of two modes: query mode 320 and profile mode 325. Thehighlighting facility also consists of an analyzer 330.

In the query mode, when a query is issued, the highlighter captures thequery at 335 (such as from the search window on the search engines webpage) as entered by the user and sends it to the analyzer 330 forsyntactic analysis and semantic expansion.

Note that instead of capturing the query from the search engine page thehighlighting application can provide a separate window or a search boxfor typing in the query. That query could then be sent to any searchengine. The advantage of this approach is that the user need not retypethe query if the user wants to use services of different search enginesor other Web services in general.

The query analyzer 330 is a (local or remote) service that takes thequery term or any other short description on a topic as input, andreturns an augmented set of terms to the client as a result. The queryterm analysis is completely independent of the actual search and can beprocessed in parallel while the search engine is processing the query.In one embodiment, the analyzer is implemented as a remote service thataccepts terms for analysis via a network connection.

The original query terms and the augmented set of terms togetherrepresent the query context as indicated at 355. The system also makesassociation between the result page and the query context in order toensure the original query is used for default highlighting until theuser explicitly changes the context. When the user browses the Webwithin this query context (by choosing one of the links that representsa document found by the search engine), the corresponding terms arehighlighted in the accessed document at 360.

Note that there can be any number of active contexts stored in theterminology highlighter. The association between the result page and theoriginal query may be used to enforce the default highlighting of allthe documents on the result list. For instance, if a user returns to theresult page of a previous query, the terms of that query context will behighlighted if a document is browsed to from the result page.Additionally, terms of one context can be applied to and highlightedwithin documents of a different query context, and new contexts can beconstructed by combining terms of other contexts (for example the termsof several related queries can be combined or merged to build a newcontext).

In the profile mode 320, the user can provide (e.g., by means of adialog box) a description of the topic of interest at 365 which is thenanalyzed at 330 analogously to the user's query to provide an augmentedset of profile terms. This set of profile terms may be created inparallel with other activities that the user may perform and is thenused as a basis for highlighting 360 of all subsequent documents thatthe user accesses either in real time, or as a background task. Themodel of the users interest may also be used as a basis for highlighting360.

In FIG. 4 a block diagram shows components involved in providingaugmented search terms and highlighting generally at 410. A user query(in the search mode) or the description of the user's interest (e.g., inthe browsing mode) is represented at 415 and is generated by a user forsending to a search engine or providing it to the read system as aninterest profile. The query may be created on a search engine page, ormay also be created on the client side in a separate window or searchbox, and then sent to the search engine. User context information isgathered at 420, and comprises an analysis of the tasks that a user isperforming, and analysis of other searches or interest profiles thatappear to be related. An analysis engine receives the query and contextinformation, and (in one embodiment) uses natural language processing at430 and semantic expansion at 435 to provide a model of the user'sinterest, which in one embodiment may be a set of augmented search terms440 or a user interest profile. Highlighting of text is then performedat 445 based on the model 440, in one embodiment by selecting a brightbackground color for all terms found in the document. When used to markor highlight portions of the document, the model provides the ability tobetter identify text which is more relevant to the actual intent of theuser. Several different types of additional highlighting are describedwith reference to further figures below. In one embodiment the documenttext is accessed and analyzed statistically and linguistically. Thisanalysis enables more sophisticated highlighting methods. For example,highlighting of terms that play a role of a subject or object in thequery or profile description is more effective for reading a documentthan highlighting in the document all the concepts that appear in thequery or the profile description. Similarly, query and interest profileterms could be highlighted in the document text only if they appear tohave a specific linguistic role, e.g., the role of a subject or object.

In FIG. 5, a flow diagram indicated generally at 510 shows scrolling ofa document to its most relevant portion based on the analysis of thedocument text. A next document identified in search results or accessedby browsing is received at 515. Subparts of the document are identifiedat 520. The subparts may be passages, sentences, lines, or paragraphs,all of a desired length or the length determined based on thedistribution of query terms in the text. The subparts may in factoverlap if desired. Each of the subparts is then scored at 525 in one ofseveral well known relevance matching function with respect to the modelof the user's interest. Statistics from any reference corpus can be usedfor that purpose. The scoring may also be similar to that used by thesearch engine, but may also include the use of the model to give abetter indication of relevancy. Further, a best portion of the documentmay be identified by combining consecutive paragraph scores or applyinganother method, such as (in one embodiment) a Hidden Markov Model (wellknown in the art) to identify the best passage at 530. At 535, thedocument is scrolled to the most relevant passage as identified above.The most relevant passage may be scrolled to in the actual document, ormay be part of a list of passages which are provided with a link at 540to corresponding documents. This provides a document list showing themost relevant passage of each document to enable the user to determinewhich document may be most relevant. If the later, decision block 545determines whether the document received was the last document in thesearch results, or selected portion of search results for this function.If not, the next document is received at 515, and its most relevantportion identified. If it was the last document, control is returned at550.

In one embodiment the scrolling of the document is based purely on thepattern matching of the document text with the query or model of theuser's interest. For example, the document is automatically scrolled tothe first occurrence in the text of an important concept in the query ormodel. Further, the document can be scrolled to the paragraph with thehighest density of the query or correlation with the model of the user'sinterest. These document scrolling methods do not require accessing andanalyzing document text in advance.

In FIG. 6, a flow diagram indicated generally at 610 shows re-ranking ofa list of documents provided by a search engine or the documents thatare linked to the currently viewed document via hyperlinks. In thesearch mode, the list of documents is received at 615, and the top Ndocuments referred to as best hits by the search engine are accessedfrom the respective servers at 620, as a background task while the usermay be looking at the list, or performing other tasks. N may range from2 to as many as resource constraints permit. N is 30 in one embodiment.The entire document, or some number (K) of pages of the document may beused. Each document may then be scored at 625 in its entirety orsimilarly to the portion scoring as described previously using arelevance matching method. The scoring may be based on the model,including at least augmented search terms and linguistic analysis of thedocument text. The list of documents is then sorted in accordance withthe document scores at 630. An alternative rank of each of the documentscan be provided, or a new list of less than N provided. The list is thenprovided to the user at 635, and control is returned at 640.

In the browsing mode, the list of documents received at 615 representall the document linked to the currently viewed document. The documentsare accessed from the respective servers at 620 in the background andscored at 625 for relevance with respect to the model of the user'sinterest that the current document may be associated with. The resultingscore for each linked document is then displayed in relation to thedocument link on the current page and serves as a guide for followingthe links if desired.

In FIG. 7, a flow diagram indicated generally at 710 showsidentification and provision of a list of entities (such as namesassociated with a document) and relations among entities in a document.A document is received at 715, and documents are downloaded at 720.Heuristics for identifying entity names and relations among entities(e.g., for person names that may include recognizing titles,capitalization, position and function in the sentence, etc.) combinedwith lexicon lookups, are then applied to identify entity names andrelations in the document at 725. A list of entity names and relationsis created at 730. At 735, links into the document corresponding to theentity names and relations are provided. In one embodiment, the list ofextracted entities is displayed in a separate window, and each entity issupplied with navigational features, such as an up and down arrow tonavigate to next and previous occurrences of the entity in the document.Information about the particular entity or entity relation may beextracted from additional resources at 740. For example, if the entityis a company name, appropriate information services providinginformation about such entities can be used to supply a link to the website of the particular company. If the entity is a person name, the usermay be able to access a person's web site using appropriate informationservices, or if the person is a publicly known figure, the latestinformation available from the press. Similarly, if two entities, forexample a person with the name N and a company with the name C areconnected through the relationship “N is the President of C” the systemcan provide the link to the pages where the person N is mentioned as thePresident of C. This feature may apply to a variety of entities, such asgeographical features, countries, trademarks, etc. and typical orimportant relations among such entities. The list of entity names andrelations with links is provided to the user at 745, and if the lastdocument has been processed at 750, control is returned at 755. Thisprocess may be applied to a selected number of documents, or maycontinue in the background as long as is desired, or until the contextis switched.

In FIG. 8, a flow diagram indicated generally at 810 shows creation of athumbnail of a document with highlighting. A next document is receivedthrough browsing or downloaded at 815 from the list of documentsprovided by a search engine. If the accessed document can be viewed as asingle screen document (of some default size, for example) a thumbnailof the whole document is created. On the Web the concept of a page isdifferent from traditional paper documents. The size of a page can be afixed size specified by the user or the system, or can be based on thesize of the window used to view the document. For multi-page documentsthe most relevant passages can be found at 820, and a thumbnail of thepage contain the best passage created at 825.

The thumbnail appears as a single sheet of paper and may either relateto the first page of a document, or some scaled version or abstractrepresentation of the document. Larger documents may even be displayedas a stack of thumbnails with navigation there between. As analternative, the thumbnail of multi-page documents can be created at 825without identifying the most relevant passages as represented by brokenline 828. Instead, the thumbnail may be an abstract representation ofthe whole document in the form of a fixed length page partitioned intoblocks that corresponds to pages. They can be colored to reflect thepresence of important terminology in the particular part of thedocument. For example, the color of the particular block can be relatedto the color used to highlight the most prominent term in that part ofthe document. The result of this approach is a thumbnail filled with thespectrum of colored blocks that visualize the relevance of each part ofthe document.

At 830, portions of the thumbnail corresponding to the most relevantpassages are highlighted. Portions may also be highlighted withoutassessing the relevance of the passages. Links are then provided at 835from the highlighted portions to the corresponding passages or portionsof the document. The thumbnail is then displayed to the user at 840, andthe process is repeated based on decision block 845 for a selectednumber of documents. Control is returned at 850.

In one embodiment the thumbnail highlighting is based on the patternmatching of the query terms or interest profile terms without deeperlinguistic analysis of the document text and identification of relevantpassages. Generally, thumbnail highlighting can be done with respect toany information about the user's interest or information extracted fromthe document.

In FIG. 9, a flow diagram indicated generally at 910 shows creation of asummary of a document. A next document is received at 915, and the mostrelevant passages with respect to the model which may include the query(in the search mode) or interest profile (in the browsing mode) orindependent from the current user's context are identified at 920 aspreviously described. Selected passages are then extracted and assembledto form a summary at 925. In this embodiment, the summaries are createdby extracting sentences from the text that contain prominent queryterminology. The summary may also be limited to a predetermined length,with the most relevant passages or sentences being used first.

Portions of the summary are highlighted at 930, and links are createdtherefrom to corresponding portions of the document at 935. The summaryis then displayed to the user at 940, and further documents areprocessed in the same manner based on decision block 945. Control isreturned at 950.

Conclusion

A highlighting facility on a computer provides information to a user toindependently assist the user in evaluating the relevance of documentsidentified by a search engine or some other information providingservice in response to a user query or the relevance of documentsaccessed in a browsing mode in relation to a particular user's interest.When accessing documents identified as relevant by the informationproviding service or in the browsing mode from other networkedcomputers, the facility determines why a document may be of interest,and provides information or highlighting to assist the user indetermining whether the document is desired.

An important characteristic of the Web is a separation of data gatheringand indexing from information delivery and presentation. The informationhighlighting facility deals with the presentation and informationhighlighting of documents to facilitate reading, comprehension, andassimilation of information found in the accessed documents. Informationhighlighting is independent of the search, and thus searches frommultiple different search engines can be relevance assessed and rankedtogether in a consistent manner. By providing the highlighting based onactual retrieved documents, up to date versions of the documents areassured. The facility may base relevancy of a retrieved document on theoriginal query, or a model of the user's interest, which may include anaugmented set of search terms or enhanced version of the query whichtakes into account the general interest of the user as captured by aninterest profile and context of use of the computer by the user, or acombination thereof. This provides a consistent and enhanced ability tocorrectly identify relevance of each document, rather than rely on thesearch engine basing relevance purely on a single query.

Linguistic analysis and semantic expansion to provide the augmentedversion or set of terms is done in parallel with the execution of thequery by one or more search engines to provide relevance more quickly.The model of the user's interest is then applied by the facility todocuments as they are accessed through a browser to provide highlightingof relevant portions of the document. The model can be thought of as aninterest profile context, or representation of the user's informationneed. When browsing the web within this context or session, thecorresponding terms are highlighted in the accessed documents.

The facility may also be run as a remote service on a powerful computer(in contrast to the possibly less powerful local computers use by theuser to further speed up processing and minimize delays. The remoteservice computer may in fact have a much higher bandwidth connection tothe network, and be able to process many documents while the user isstill considering the list of documents returned by the search engine orsome other information providing service.

Documents may be scrolled to the most relevant portion of a multi-pagedocument based on pattern matching of the document text with the queryor interest profile terms or by relevance scoring of individualparagraphs or subparts of the document based on the model. The list ofdocuments provided by one or more search engines may also be re-rankedbased on relevance ranking and based on a representation of the user'sneed. The re-ranking may be based on summaries provided by the searchengines, or by actually retrieving the documents and either patternmatching with the augmented terms or performing a deeper linguistic andstatistical analysis of the document text, or based on the model andassessing the document relevance to the query.

Information, such as names of entities (e.g., the person's or a companyname) and the relations among the entities may be extracted using wellknown heuristics and lexicon lookups, and provided as a list, linkedback into the document. For such names and relations, external links canalso be found by local lookup or query and provided to the user.Further, based on the downloaded documents, thumbnails of the documentsmay be created with highlighting corresponding to the most relevantportions of the documents. Links to the document are provided within thethumbnail based on the highlighting or discrete portions within thethumbnail corresponding to the relevant portions of the document. Thethumbnail provides a visual representation of the relevance of theentire document and allows the user to quickly identify an area of thedocument to help determine its relevance.

A summary of the document text can be provided by extracting salientsentences from the text as identified by pattern matching with theaugmented terms or a deeper linguistic and statistical analysis of thedocument text, or based on the model. Summaries can also be generatedbased on important entities and entity relations found in the text,related to or independent from the current user's interest or querycontext. In a browsing mode, the internal and external links on a webpage currently viewed can be assessed by downloading the text of linkeddocuments in the background and assessing their relevance to the user'sneed and interest. Such information may be communicated to the user asan aid in deciding whether or not to follow the links.

These different ways of providing relevance information can be dividedinto categories based on whether they require analysis of the targetdocuments or not. Some can be effectively implemented based on a veryshallow analysis of the document text, practically by pattern matchingwithout having to access the document in advance. These includehighlighting, scrolling and thumbnail creation and highlighting. Someways are better implemented by downloading the document text andproviding a deeper linguistic analysis of the text. These include moresophisticated document highlighting, scrolling and thumbnailhighlighting, entity extraction and entity relation finding,summarization of documents, re-ranking of the retrieved documents andassessment of hyperlinks in the documents.

The model of the user's interest may also vary across a broad spectrumfrom simple to more detailed. The original user's description of thequery may be used in one embodiment. Further variations include usingthe augmented query, an original description of the interest profile, anenhanced description of the interest profile, general interest profileswhich are not user specific, but are selected from some topicalhierarchy—a library of topic profiles, and query/interest profilecombined with information about the user's task.

In the present invention, document presentation and document analysisfeatures within a distributed computer network environment are providedwhere document gathering, indexing and relevance assessment with respectto a user's query is independent from document delivery and presentationto the user. The user's need is separated from the search strategy. Inother words, the user's query and interest profile are modeledindependently from search activities such as by applying linguisticanalysis. Further, support for relevance assessment is provided in boththe search and browsing modes. The user interest model is applied toview and analyze documents that are accessed as a result of the searchactivity or by browsing Web documents.

This application is intended to cover any adaptations or variations ofthe present invention. It is manifestly intended that this invention belimited only by the claims and equivalents thereof.

1. A computer implemented method of displaying documents accessed in asearch or browsing mode, the method comprising: creating a model of auser's interest; accessing documents from a source of such documents;applying the model of the user's interest to the retrieved documents;and generating information regarding the relevancy of the retrieveddocuments.
 2. The method of claim 1 wherein the model comprises a querywhich is enhanced based on linguistic analysis.
 3. The method of claim 2wherein the linguistic analysis comprises syntactic and semanticanalysis.
 4. The method of claim 1, wherein the model comprises a querywhich is enhanced based on a general interest profile.
 5. The method ofclaim 4, wherein the general interest profile is applied equally todocuments accessed by the user in both search and browsing modes.
 6. Themethod of claim 1, wherein the model of user interest is based at leastpartially on the user task.
 7. The method of claim 1, wherein theinformation is used to highlight relevant portions of text in theretrieved documents.
 8. The method of claim 1, wherein the modelcomprises a query which is enhanced independently of and during theexecution of the query by the search engine.
 9. The method of claim 1,wherein the model comprises a query which is applied to the accesseddocuments to assess relevance during retrieval of documents from theirsources.
 10. The method of claim 1, wherein documents are retrievedwhile a user that generated a query may be performing other tasks.
 11. Acomputer readable medium having instructions stored thereon that causesa computer to perform the method of claim 1.